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Abstract — Upper and lower bounds are obtained for the joint 
entropy of a collection of random variables in terms of an 
arbitrary collection of subset joint entropies. These inequalities 
generalize Shannon's chain rule for entropy as well as inequalities 
of Han, Fujishige and Shearer. A duality between the upper 
and lower bounds for joint entropy is developed. All of these 
results are shown to be special cases of general, new results for 
submodular functions- thus, the inequalities presented constitute 
a richly structured class of Shannon-type inequalities. The new 
inequalities are applied to obtain new results in combinatorics, 
such as bounds on the number of independent sets in an arbitrary 
graph and the number of zero-error source-channel codes, as 
well as new determinantal inequalities in matrix theory. A new 
inequality for relative entropies is also developed, along with 
interpretations in terms of hypothesis testing. Finally, revealing 
connections of the results to literature in economics, computer 
science, and physics are explored. 

Index Terms — Entropy inequality; inequality for minors; 
entropy-based counting; submodularity. 



I. Introduction 

LET X%,X2, ■ ■ ■ , X n be a collection of random variables. 
There are the familiar two canonical cases: (a) the 
random variables are real-valued and possess a probability 
density function, in which case h represents the differential 
entropy, or (b) they are discrete, in which case H represents the 
discrete entropy. More generally, if the joint distribution has a 
density / with respect to some reference product measure, the 
joint entropy may be defined by —E[log f{X\, X2, ■ ■ ■ , X n )]; 
with this definition, H corresponds to counting measure and h 
to Lebesgue measure. The only assumption we will implicitly 
make throughout is that the joint entropy is finite, i.e., neither 
—00 nor +00. 

We wish to discuss the relationship between the joint 
entropies of various subsets of the random variables 
Xi, X 2 , ■ ■ ■ , X n . Thus we are motivated to consider an ar- 
bitrary collection C of subsets of {1,2,..., n}. The following 
conventions are useful: 

• [n] is the index set {1,2,..., n}. We equip this set with 
its natural (increasing) order, so that 1 < 2 < . . . < n. 
(Any other total order would do equally well, and indeed 

Material in this paper was presented at the Information Theory and 
Applications Workshop, San Diego, CA, January 2007, and at the IEEE 
Symposium on Information Theory, Nice, France, June 2007. 

Mokshay Madiman is with the Department of Statistics, Yale Uni- 
versity, 24 Hillhouse Avenue, New Haven, CT 06511, USA. Email: 
mokshay . madiman@yale . edu 

Prasad Tetali is with the School of Mathematics and College of Com- 
puting, Georgia Institute of Technology, Atlanta, GA 30332, USA. Email: 
tetali@math.gatech.edu. Supported in part by NSF grants DMS- 
0401239 and DMS-0701043. 



we use this flexibility later, but it is convenient to fix a 
default order.) 

• For any set s C [n], X s stands for the collection of 
random variables (Xi : i £ s), with the indices taken 
in their increasing order. 

• For any index i in [n], define the degree of i in C as 
r(i) = \{t € C : i € t}\. Let r_(s) = mini es r(i) denote 
the minimal degree in s, and r+(s) = maxjg s r(i) denote 
the maximal degree in s. 

First we present a weak form of our main inequality. 

Proposition I: [Weak degree form] Let X 1 ,...,X n be 
arbitrary random variables jointly distributed on some discrete 
sets. For any collection C such that each index i has non-zero 
degree, 



sec 



H(X s \X s o) 
r+(s) 



< 



H(X [n] ) 



< 



seC 



H(X S ) 
r-(s) ' 



(1) 



where r + (s) and r_(s) are the maximal and minimal degrees 
in s. If C satisfies r_(s) = r + (s) for each s in C, then ([T| 
also holds for h in the setting of continuous random variables. 



Proposition I unifies a large number of inequalities in the 
literature. Indeed, 

1) Applying to the class C\ of singletons, 

n n 

Y.HiXilXwy) < H{X [n] ) < ^J/(X f ), (2) 

The upper bound represents the subadditivity of entropy 
noticed by Shannon. The lower bound may be inter- 
preted as the fact that the erasure entropy of a collection 
of random variables is not greater than their entropy; see 
Section VI for further comments. 

2) Applying to the class C„_i of all sets of n — 1 elements, 

1 n 

^ZTJ^2 H ( x [n]v\X t ) < H(X [n] ) 
»=i 

1 n 



(3) 



This is Han's inequality [23], [10], in its prototypical 
form. 

3) Let r+ = min ig r n i r(i) and r_ = max ie j„] r(i) be the 
minimal and maximal degrees with respect to C. Using 

r_ < r_(s) and r + < r + (s), we have 

-J2 H ( x s\ x *°) < H{X [n] ) < —J2H(X S ). 
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sec 
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The upper bound is Shearer's lemma [9], known in the 
combinatorics literature [43]. The lower bound is new. 
The paper is organized as follows. First, in Section [II] 
the notions of fractional coverings and packings using hyper- 
graphs, which provide a useful language for the information 
inequalities we present, are developed. In Section III we 



present the main technical result of this paper, which is a new 



inequality for submodular set functions. Section IV presents 
the main entropy inequality of this paper, which strengthens 
Proposition I, and gives a very simple proof as a corollary 
of the general result for submodular functions. This entropy 
inequality is developed in two forms, which we call the strong 
fractional form and the strong degree form; Proposition I may 
then be thought of as the weak degree form. A different 
manifestation of the upper bound in this weak degree form 
of the inequality was recently proved (in a more involved 
manner) by Friedgut [15]; the relationship with his result is 



also further discussed in Section IV using the preliminary 
concepts developed in Section |Tl| 

While independent sets in graphs have always been of 
combinatorial and graph-theoretical interest, counting inde- 
pendent sets in bipartite graphs received renewed attention 
due to Kahn's entropy approach [26] to Dedekind's prob- 
lem. Dedekind's problem involves counting the number of 
antichains in the Boolean lattice, or equivalently, counting 
the number of Boolean functions on n variables that can be 
constructed using only AND and OR (and no NOT) gates. To 
handle this problem by induction on the number of levels in the 
lattice, Kahn first derived a tight bound on the (logarithm of 
the) number of independent sets in a regular bipartite graph. 
In Section [V] we build on Kahn's work to obtain a bound 
on number of independent sets in an arbitrary graph. We 
also generalize this to counting graph homomorphisms, with 
applications to graph coloring and zero-error source-channel 
codes. 

The applications of entropy inequalities to counting typi- 
cally involves discrete random variables, but the inequalities 
also have applications when applied to continuous random 
variables. In Section VI we develop such an application by 
proving a new family of determinantal inequalities that provide 
generalizations of the classical determinantal inequalities of 
Hadamard, Szasz and Fischer. 

Having presented two applications of our main inequalities, 
we move on to studying the structure of the inequalities more 
closely. In Section [VTH we present a duality between our upper 
and lower bounds that generalizes a theorem of Fujishige [17]. 
In particular, we show that the collection of upper bounds on 
H(Xi n ]) for all collections C is equivalent to the collection 
of lower bounds. There we also discuss interpretations of the 
inequality relating to sensor networks and erasure entropy, and 
generalize the monotonicity property for special collections of 
subsets discovered by Han [23]. 



Section VIII presents some new entropy power inequalities 
for joint distributions, and points out an intriguing analogy 
between them and the recent subset sum entropy power 
inequalities of Madiman and Barron [33]. In Section IX we 



hypothesis testing and concentration of measure are also given 
there. 

In Section |Xj we note that weaker versions of our main 
inequality for submodular functions follow from results devel- 
oped in various communities (economics, computer science, 
physics); this history and the consequent connections do 
not seem to be well known or much tapped in information 



prove inequalities for relative entropy between joint distribu- 
tions. Interpretations of the relative entropy inequality through 



theory. Finally in Section XI we conclude with some final 
remarks and brief discussion of other applications, including 
to multiuser information theory. 

II. On Hypergraphs and Related Concepts 

It is appropriate here to recall some terminology from 
discrete mathematics. A collection C of subsets of [n] is called 
a hypergraph, and each set s in C is called a hyperedge. 
When each hyperedge has cardinality 2, then C can be thought 
of as the set of edges of an undirected graph on n labelled 
vertices. Thus all the statements made above can be translated 
into the language of hypergraphs. In the rest of this paper, 
we interchangeably use "hypergraph" and "collection" for C, 
"hyperedge" and "set" for s in C, and "vertex" and "index" 
for i in [n]. 

We have the following standard definitions. 

Definition I: The collection C is said to be r-regular if each 
index i in [n] has the same degree r, i.e., if each vertex i 
appears in exactly r hyperedges of C. 

The following definitions extend the familiar notion of pack- 
ings, coverings and partitions of sets by allowing fractional 
counts. The history of these notions is unclear to us, but 
some references can be found in the book by Scheinerman 
and Ullman [44]. 

Definition II: Given a collection C of subsets of [n], a function 
a : C — > R + , is called a fractional covering, if for each 
% e [n], we have £ s6C:ies a(s) > 1. 

Given C, a function f3 : C — * R + is a fractional packing, if 
for each i 6 [n], we have J2 s ec-ies @( s ) — 1- 

If 7 : C — ► R + is both a fractional covering and a fractional 
packing, we call 7 a fractional partition. 

Note that the standard definition of a fractional packing 
of [n] using C (as in [44]), would assign weights /3, to the 
elements, (rather than sets) i £ [n], and require that, for 
each s E C, we have Ylies 0% — !• O ur terminology can 
be justified, if one considers the "dual hypergraph," obtained 
by interchanging the role of elements and sets - consider the 
0-1 incidence matrix (with rows indexed by the elements and 
columns by the sets) of the set system, and simply switch the 
roles of the elements and the sets. 

The following simple lemmas are useful. 

Lemma I: [Fractional Additivity] Let {ai : i e [n]} 
be an arbitrary collection of real numbers. For any s c [n], 
define a s = J2je s a j- F° r anv fractional partition 7 using 
any hypergraph C, a[ n ] = J2 s ec l( s ) a s- Furthermore, if each 
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di > 0, then 



^2(3(s)a s < a [n] < ^2a(s)a s 
sec sec 



(4) 



for any fractional packing and any fractional covering a 
using C. 

Proof: Interchanging sums implies 

E a(s ) E a * = E fl! E a ( s ) 1 {ies} ^ E a *' 

sec ies ie[n] sec 



ie[n\ 



using the definition of a fractional covering. The other state- 
ments are similarly obvious. ■ 
We introduce the notion of quasiregular hypergraphs. 

Definition III: The hypergraph C is quasiregular if the degree 
function r : [n] — > Z + defined by r(i) = \{s £ C : s 3 i}\ is 
constant on s, for each s in C. 

Example: One can construct simple examples of quasiregular 
hypergraphs using what are called bi-regular graphs in the 
graph theory literature. Consider a bipartite graph on vertex 
sets V\ and V 2 (i.e., all edges go between V\ and V 2 ), such 
that every vertex in V\ has degree r\ and every vertex in V 2 
has degree r 2 . Such a graph always exists if |Vi|ri = |r 2 . 
Now consider the hypergraph on V\ U V 2 with hyperedges 
being the neighborhoods of vertices in the bipartite graph. This 
hypergraph is quasiregular (with degrees being r\ and r 2 ), and 
it is not regular if n is different from r 2 . 

There is a sense in which all quasiregular hypergraphs are 
similar to the example above; specifically, any quasiregular 
hypergraph has a canonical decomposition as a disjoint union 
of regular sub hypergraphs. 

Lemma II: Suppose the hypergraph C on the vertex set [n] is 
quasiregular. Then one can partition [n] into disjoint subsets 
{T/m,}, and C into disjoint subhypergraphs {C m } such that each 
C, n is a regular hypergraph on vertex set V m . 

Proof: Consider the equivalence relation on [n] induced 
by the degree, i.e., i and j are related if r(i) = r(j). 
This relation decomposes [n] into disjoint equivalence classes 
{Vm}. Since C is quasiregular, all indices in s have the same 
degree for each set s e C, and hence each s e C is a subset 
of exactly one equivalence class V m . Q.E.D. ■ 

The notion of quasiregularity is related to what we believe 
is an important and natural fractional covering/packing pair. 
As long as there is at least one set s in the hypergraph C that 
contains i, we have 



E 

seC, s3' 



r -( S ) ~ 7^r r -( S ) ~ ttc T ^ 



= 1, 



so that a(s) = - ^ provide a fractional covering. Similarly, 
the the numbers (3(s) = ^ provide a fractional packing. 

Definition IV: Let C be any hypergraph on [n] such that 
every index appears in at least one hyperedge. The fractional 



covering given by a(s) = - is called the degree covering, 
and the fractional packing given by /3(s) = r ^ s ^ is called the 
degree packing. 

The following lemma is a trivial consequence of the defini- 
tions. 

Lemma III: If C is quasiregular, the degree packing and 
degree covering coincide and provide a fractional partition of 
[n] using C. In particular, or n i = ^2 seC a s /r_(s). 

One may define the weight of a fractional partition as 
follows. 

Definition V: Let 7 be a fractional partition (or a fractional 
covering or packing). Then the weight of 7 is w(j) = 
E seC 7(s)- 

There are natural optimization problems associated with the 
weight function. The problem of minimizing the weight of a 
over all fractional coverings a is the called the optimal frac- 
tional covering problem, and that of maximizing the weight 
of (3 over all fractional packings is the called the optimal 
fractional packing problem. These are linear programming 
relaxations of the integer programs associated with optimal 
covering and optimal packing, which are of course important 
in many applications. Much work has been done on these 
problems, including studies of the integrality gap (see, e.g., 
[44]). 

One may also define a notion of duality for fractional 
partitions. 

Definition VI: For any hypergraph C, define the complimen- 
tary hypergraph as C = {s c : s e C}. If a is a fractional 
covering (or packing) using C, the dual fractional packing 
(respectively, covering) using C is defined by 



a(s) 



w(a) 



To see that this definition makes sense (say for the case of 
a fractional covering a), note that for each i £ [n], 

E «(* C )= E ~~7~~7^~~r 
4^ ^ w(a) - 1 

s c eC,s c Bi seC,i£s 

= Esec a ( s ) - E s ec,»e s Q 0) 

w(a) — 1 
< w(a) - 1 = 1 
— w(a) — 1 



III. A NEW INEQUALITY FOR SUBMODULAR FUNCTIONS 

The following definitions are necessary in order to state the 
main technical result of this paper. 

Definition VII: The set function / : 2^ — > R is submodular 
if 

/(*) + f(t) > f(s u t) + f(s n t) 
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for every s,t C [n]. If — / is submodular, we say that / is 
supermodular. 

Definition VIII: For any disjoint subsets s and t of [n], define 
f(s\t) = f(s U t) - f(t). For a fixed subset t C [n], the 
function f t : 2^* -> K defined by / t (s) = /(s|<) is called / 
conditional on t. 

For any s C [n], denote by < s the set of indices less 
than every index in s. Similarly, > s is the set of indices 
greater than every index in s. Also, the index i is identified 
with the set {i}; thus, for instance, < i is well-defined. We 
also write [i : i + k] for {i, i + 1, . . . , i + k — 1, i + k}. Note 
that [n] = [1 : n]. 



Thus 



Lemma IV: Let / : 2^ 1 
with f(<f>) = 0. 

1) If s,t, u are disjoint sets, 



be any submodular function 



f(s\t,u)<f(s\t). (5) 
2) The following "chain rule" expression holds for f([n]): 

/(W) - E < o- 



Proof: First note that if s,t,u are disjoint sets, then 
submodularity implies 

f(sUtUu) + f(t) < f(s U t) + /(* U u), 

which is equivalent to f(s\t,u) < f(s\t). 

The "chain rule" expression for /([n]) is obtained by 
induction. Note that f{[2]) = /(1) + /(2|1) = /(1|0) + /(2|1) 
since f(<f>) — 0. Now assume the chain rule holds for [n], and 
observe that 

/([n + l]) = /([n]) + /(n + l|[n])= ^ ^ < l )' 

ie[n+l] 

where we used the induction hypothesis for the second equal- 
ity. ■ 

Theorem I: Let / : 2N _> R be any submodular function 
with f(cp) = 0. Let 7 be any fractional partition with respect 
to any collection C of subsets of [n). Then 

J2i(s)M* c \ >*)< /(H) < E^)/( s i < s ) • 

sec sec 



Proof: The chain rule (actually a slightly extended 
version of it with additional conditioning in all terms that can 
be proved in exactly the same way) implies 



f( s \ <s ) = J2f(j\ <jns,<s). 



(6) 



jes 



E «( s )/( s l < s ) - E E /C?1 < i n s , < S ) 

sec sec jes 

> x;«we/cji<j) 

sec j£s 

= E/^^^E^) 1 ^} 

ie[n] sec 

> E/^<^) 

J'6[n] 
= /C^jn])) 

where (a) follows by the chain rule (|6j, (b) follows from ((5), 
(c) follows by interchanging sums, and (d) follows by the 
definition of a fractional covering. 

The lower bound may be proved in a similar fashion by a 
chain of inequalities. Indeed, 

E/3(s)/(s|s c \>*) 

sec 

= E/ 3 ( s )E/oi<-?' ns ' sC \ >s ) 

sec jes 

< E ^'E-/'-/ 

sec jes 
je[n] sec 

< E^i<j) 

i'e[n] 

( = 5 /([»]), 

where (a), (b), (c) follow as above, and (e) follows by the 
definition of a fractional partition. ■ 

Remark 1: The key new element in this result is the fact 
that one can use, for any ordering on the ground set [ri], the 
conditional values of / that appear in the upper and lower 
bounds for f([n]). Because of |3J, this is an improvement over 
simply using /. The latter weaker inequality has been implicit 
in the cooperative game theory literature; various historical 
remarks explicating these connections are given in Section [X] 



Corollary I: Let / : 2^ -► K be any submodular function 
with /(</>) = 0, such that /([?']) is non-decreasing in j for 
j E [n]. Then, for any collection C of subsets of [n], 

J2mf(s\s c \ >s)< /(W) < E«( s )/( s i < s ) . 

sec sec 
where (3 is any fractional packing and a is any fractional 
covering of C. 

Proof: The proof is almost exactly the same as that of 
Theorem I; the only difference being that the validity there of 
(d) for fractional coverings and of (e) for fractional packings 
is guaranteed by the non-negativity of f(j\ < j). ■ 
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Observe that if / defines a polymatroid (i.e., / is not 
only submodular but also non-decreasing in the sense that 
f(s) < f(t) if s C t), then the condition of Corollary I is 
automatically satisfied. 

IV. Entropy Inequalities 

A. Strong Fractional Form 

The main entropy inequality introduced in this work is the 
following generalization of Shannon's chain rule. 

Theorem I': [Strong fractional form] For any collection 
C of subsets of [n], 

Y,mH(X s \X s c X>s ) < H(X [n] ) < Y,a(s)H(X s \X <s ) 

sec sec 

and 

Y,l(s)h(X s \X s r_\ >s ) < h(X [n] ) < J2i{s)KXs\x <s ) , 



sec 



sec 



where (3 is any fractional packing, a is any fractional covering, 
and 7 is any fractional partition of C. 

One can give an elementary proof of Theorem I' as a 
refinement of that given by Llewellyn and Radhakrishnan 
for Shearer's lemma (see [43]). However, instead of giving 
the proof in terms of entropy (which one may find in the 
conference paper [35]), we have proved in Theorem I a 
more general result that holds for the rather wide class of 
submodular set functions. To see that Theorem I' follows from 
Theorem I, we need to check that the joint entropy set function 
f(s) = H(X S ) is a submodular function with f(<j>) = 0. 
The submodularity of / is a well known result that to our 
knowledge was first explicitly mentioned by Fujishige [17], 
although he appears to partially attribute the result to a 1960 
paper of Watanabe that we have been unable to find. It follows 
from the fact that H(X S ) + H(X t ) - H(X sUt ) - H(X, nt ) = 
I(X s \ t ; X t \ s \X sn t) is a conditional mutual information (see, 
e.g., Cover and Thomas [10]), which is guaranteed to be 
non-negative by Jensen's inequality. To see that the "correct" 
definition of /(</>) = 0, note that the "unconditional" entropy 
H(X S ) should be equal to H(X S \X^,), but the latter is 
H (X s ) — H (X 0) by definition, which suggests that H(X<f,) = 
0. 

Again, we would like to stress the freedom given by 
Theorem I' in terms of choice of ordering. For convenience of 
notation, we simply chose one labelling of the indices using 
the natural numbers and used the ordering 1 < 2 < . . . < n, 
but one may equally well use another labelling or ordering. 

Remark 2: It is natural to ask what choices of fractional 
packing and covering optimize the lower and upper bounds 
respectively. For a given collection of subset entropies, the 
optimal choices are clearly the solution of a linear program. In- 
deed, the best upper bound is obtained, for w s = H(X S \X <S ), 
by solving: 

Minimize J2 seC a ( s ) w s 

subject to a(s) > and J2 s ec ssi a ( s ) — 



When the subset entropies are all equal, this is just the problem 
of optimal fractional covering discussed in Section [il] 



B. Strong Degree Form 

The choice of a as the degree covering and j3 as the degree 
packing in Theorem I' gives the strong degree form of the 
inequality. 

Theorem II: [Strong Degree Form] Let C be any collec- 
tion of subsets of [n], such that every index i appears in at 
least one element of C. Then 



E 

sec 



H(X s \X s c\ >s ) . , ^ H{X S \X <S ) 
r+(s) r_(s 



If C is quasiregular, then the above inequality also holds for 
h in place of H. 

Remark 3: This also proves Proposition I. Indeed, since 
conditioning reduces entropy, Proposition I is just the loose 
form of Theorem II obtained by dropping the conditioning on 
< s in the upper bound, and including conditioning on > s in 
the lower bound. 

Remark 4: The collections C for which the results in this 
paper hold need not consist of distinct sets. That is, one may 
have multiple copies of a particular s C [n] contained in C, and 
as long as this is taken into account in counting the degrees 
of the indices (or checking that a set of coefficients forms 
a fractional packing or covering), the statements extend. We 
will make use of this feature when developing applications to 
combinatorics in Section IVl 

Remark 5: Using the previous remark, one may write down 
Theorem II with arbitrary numbers of repetitions of each set in 
C. This gives a version of Theorem I' with rational coefficients, 
following which an approximation argument can be used to 
obtain Theorem I'. This proof is similar to the one alluded 
to by Friedgut [15] for the version without ordering. Thus 
Theorem II is actually equivalent to Theorem I'. 

The strong degree form of the inequality generalizes Shan- 
non's chain rule. In order to see this, simply choose the 
collection C to be C\, the collection of all singletons. For this 
collection, Theorem II says 

n n 

Y^HiX^X^i) < H(X [n] ) K^HiXilX^), 

i=l i=l 

which is precisely Shannon's chain rule (see, e.g., Shannon 
[45] and Cover and Thomas [10]), since the upper and lower 
bounds are identical. Note in contrast the looseness of the 
upper and lower bounds in Q, which are tight if and only if 
the random variables Xi are independent. 

Application of Theorem II to non-symmetric collections is 
also of interest. For instance, choosing C to be the class of 
all sets of k consecutive integers yields r_ = 1 and r+ = k. 
Thus 



H{X [n] ) 



Eje[n] H ( X b-iU)]\ X <j) 



1 



(7) 
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where = min{j + k— 1, n}. These examples make it clear 
that Theorem II is rather powerful and generalizes well known 
results in addition to producing new ones. 

C. Weak Fractional Form 

Theorems I' and II can be weakened by removing the 
conditioning in the upper bound, and adding conditioning in 
the lower bound; from the latter, one obtains the weak degree 
form of Proposition I, and from the former, one obtains the 
weak fractional form of our main inequality. 

Proposition II: [Weak Fractional Form] For any hyper- 
graph C on [n], 

J2P(s)H(X s \X s o) < H(X [n] ) <J2a(s)H(X s ) (8) 



sec 



sec 



and 



J2l(s)h(X s \X s «) < h(X [n] ) < J2^)HX S ), (9) 



sec 



sec 



where (3 is any fractional packing, a is any fractional covering, 
and 7 is any fractional partition of C. 

Remark 6: While the main inequality as stated in both its 
degree form (Theorem II) and its fractional form (Theorem I') 
seems novel, the bounds have been known to various levels of 
generality, as pointed out in the Introduction. In the discrete 
mathematics community, particular forms of the upper bound 
have been well known ever since the introduction of Shearer's 
lemma by Chung, Graham, Frankl and Shearer [9] (see also 
Radhakrishnan [43] and Kahn [25]). In the level of generality 
of Proposition II, the fractional form was demonstrated by 
Friedgut [15] in terms of hypergraph projections. Friedgut's 
proof of the upper bound is perhaps not as transparent as the 
one we give. In the information theory community, both the 
upper and lower bounds of Proposition II have been known 
for the special case of the hypergraphs Ck (consisting of all 
sets of k elements out of n), since the work of Han [23] and 
Fujishige [17]. In this paper, we unify and extend all of these 
results. 

Remark 7: In the case of independent random variables, 
the joint entropy H{X S ) = H(X S \X S .) = E ies H ( X i) is 
additive. Thus in that case, for any quasiregular hypergraph C, 
Proposition I holds with equality, and this is just Lemma III 
with ai = H(Xi). Similarly, thanks to Lemma I, Proposition 
II holds with equality for independent random variables when 
a = (3 is a fractional partition. 

We believe that both the degree formulations of Proposition 
I and Theorem II, and the fractional formulations of Theorem 
I' and Proposition II are useful ways to think about these 
inequalities, and that they pave the way to the discovery of new 
applications. We illustrate this by using the degree formulation 
to count independent sets in graphs in Section [V] and by 
using the fractional formulation to obtain new determinantal 



inequalities in Section VI 



V. An Application to Counting 

A. Entropy and Counting 

It is necessary to recall some terminology from graph theory. 
For our purposes, a graph G — (V, E) consists of a finite 
vertex set V and a collection E of two-element subsets of 
V called edges (allowing repetition, i.e., self-loops). Thus G 
is a special case of a hypergraph, each hyperedge having 
cardinality 2. Two vertices are said to be adjacent, if there 
is an edge containing both of them. An independent set of 
G is a subset Vj of V such that no two vertices in Vj are 
adjacent. 

Given a graph F = (V(F),E(F)), the set Hom(G, F) of 
homomorphisms from G to F is defined as 

Hom(G, F) = {x :V -► V{F) s.t. 

uv G E =*> x(u)x(v) G E(F)}. 

Let K a j, denote the complete bipartite graph between parts of 
sizes a and b respectively. 

Shearer's lemma, and more generally, entropy-based argu- 
ments, have proved very useful in combinatorics. Shearer's 
lemma was (implicitly) introduced by Chung, Graham, Frankl 
and Shearer [9], and Kahn [25] stated an extension using 
the more familiar entropy notation. Recent applications of 
Shearer's lemma to difficult problems (where counting bounds 
are a key step in obtaining the results) include Furedi [19], 
Friedgut and Kahn [16], Kahn [26], [25], Brightwell and Tetali 
[6], and Galvin and Tetali [21]. Radhakrishnan [43] provides 
a nice survey of entropy ideas used for counting and various 
applications; see also the book by Alon and Spencer [1]. 

The general strategy of entropy-based proofs in counting is 
as follows: 

• To count the number of objects in a certain class C of 
objects, consider a randomly drawn object X from the 
class and note that its entropy is H(X) = log \C\. 

• Represent X using a collection of discrete random vari- 
ables, and apply a Shearer-type lemma to bound H(X) 
using certain subset entropies for a clever choice of 
hypergraph dictated by the problem. 

• Perform an estimation of the resulting bound, using 
Jensen's inequality if necessary. 

Below, we follow this direction of work and demonstrate a 
counting application of the new inequality. In particular, we 
use Theorem F to bound the number of independent sets of 
an arbitrary graph, the number of proper graph colorings with 
a fixed number of colors, and more generally the number of 
graph homomorphisms. 

B. Counting graph homomorphisms 

Using Shearer's entropy inequality as a key ingredient, Kahn 
[27] recently showed a bound on the number of independent 
sets of a regular graph G, building on his earlier result 
[25] for bipartite, regular graphs. Kahn's proof extends in a 
straightforward way, as observed by D. Galvin [20], to also 
provide an upper bound on the number of homomorphisms 
from a d-regular graph G to arbitrary graph F. Theorem IV 
below extends the observations of Kahn and Galvin to bound 
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the number of graph homomorphisms from an arbitrary graph 
G to an arbitrary graph F. 

Theorem III: [Graph Homomorphisms] For any iV-vertex 
graph G and any graph F, 

|Hom(G,F)| < H \Kom(K p(v)>p(v) ,F)\*k , (10) 

where p(v) denotes the number of vertices preceding v in any 
ordering induced by decreasing degrees. 

Proof: Let X be chosen uniformly at random from 
Hom(G, F). The random homomorphism X can be repre- 
sented by the values it assigns to each i £ V, i.e., X = 
(X(l),X(2),...,X(n)) = (X 1 ,X 2 ,...,X n ), where X t £ 
Vp. By definition, Xi and Xj are connected in F if i and j 
are connected in G. We aim to bound H{X) from above. 

Let -< denote an ordering on vertices according to the 
decreasing order of their degrees (ties may be broken, for 
instance, by using an underlying lexicographic ordering of V). 
For each i £ V, let 

P(i) = {j g V : {i,j} e E and j -< i} , 

and define p(i) = \P(i)\. Consider the collection C to be the 
collection of P(i), and in addition, p(i) copies of singleton 
sets {i}, for each i. Then observe that each i is covered by 
d(i) sets in C, i.e., that the degree of i in the collection C 
is r(i) = d(i). Indeed, each i appears in d(i) — p(i) sets of 
the form, P(j), corresponding to each j such that i -< j and 
{i,j} G E, and once in each of the p(i) singleton sets {«}. 

By the upper bound in Theorem II applied to this collection 
C, we have 



Then 



H{X) < E 



£v mm jeP(j) d(j) 

P(i) 
d(i) 



d(i) 



H{Xi\X P{i) ) 



by relaxing the conditioning and by the fact that the chosen 
ordering makes j £ P(i) imply d(j) > d(i). 

Let qi denote the probability mass function of Xpu\, which 
takes its values in = {xpu\ : x £ Hom(G, F)}. In other 
words, q(xpu\) is the probability that X P ^ = Xp^\, under 
the uniform distribution on X. Finally, let R{xpa\) be the 
number of values that Xi can take given that Xpu) = Xpu p 
i.e., the support size of the conditional distribution of Xj given 
-Xp(j) = xp(j). Note that this is also the number of possible 
extensions of the partial homomorphism on P(i) to a partial 
homomorphism on P(i) U {i}. 



+ p{i)q{xp{i))H{X l \X P ( i) =xp(i)] 
R(x P(i) ))PW 



< E l( x P(i)) l °S- 



Xpr^ £X i 

<io g J2 ^w)) ?(,) 



q{xp(i)) 



x t 



where R(xp/^) is the cardinality of the range of X^ given that 
Xpu\ = Xpu\, and we have bounded H(Xj\Xpu\ = Xpu\) 
by \ogR{xpu\), and the last inequality follows by Jensen's 
inequality. Thus 



H(X) < 



^ d(i) 



h log ( 



E 

xpf^eXi 



Ri(x P( i)) Ki) 



The proof is completed by observing that, for any i £ V, 

E Ri(x PW ) pW < |Hom(/f p(i)iP(i) ,F)|. (11) 
xp^eXi 

Indeed, first note that every (partial) homomorphism Xpu\ of 
P(i) for any graph G (regardless of the ordering -<) is trivially 
a valid (partial) homomorphism of one side of -Kp(i),p(i), 
since each side of this bipartite graph has no edges and 
\P(i)\ — p(i)- Furthermore, for a valid Xpn}, the number of 
extensions Ri{xpu\) to i is the same whether the graph is G 
or K p (i\ p (i\, since it only depends on F. This proves ( fTT) . 
Note that the inequality ( fTT| can be strict, since there can be 
partial homomorphisms of one side of ■Kp(i), p (j) to a given 
F which are not necessarily valid while considering (partial) 
homomorphisms from G to F, since the induced graph on 
P(i), for a given i, might have some edges. (This corrects the 
claim in [21] that (JTTJ holds with equality.) ■ 

Nayak, Tuncel and Rose [42] note that zero-error source- 
channel codes are precisely graph homomorphisms from a 
"source confusability graph" Gjj to a "channel characteristic 
graph" Gx- Thus, Theorem IV may also be interpreted as 
giving a bound on the number of zero-error source channel 
codes that exist for a given source-channel pair. 

C. Counting independent sets 

By choosing appropriate graphs F, various corollaries can 
be obtained. In particular, it is well known that the problem 
of counting independent sets in a graph can be cast in the 
language of graph homomorphisms. Choose F to be the graph 
on two vertices joined by an edge, and with a self-loop on one 
of the vertices. Then, by considering the set of vertices of G 
that are mapped to the un-looped vertex in F, it is easy to 
see that each homomorphism from G to F corresponds to an 
independent set of G. This yields the following corollary. 

Corollary II: [Independent Sets] Let G = (V,E) be an 
arbitrary graph on N vertices, and let T(G) denote the set 
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1 
Independent sets 



3 




5-colorings 

Fig. 1. The graphs F relevant for counting independent sets and number of 
5-colorings. 



of independent sets of G. Let -< denote an ordering on V ac- 
cording to decreasing order of degrees of the vertices, breaking 
ties arbitrarily. Let p(v) denote the number of neighbors of v 
which precede v, under the -< ordering. Then 

\i(G) \ < n 2 {p{v)+i) ^). 

v£V 

Specializing to the case of <i-regular graphs G = (V,E) on 
n vertices, it is clear that 

\X(G)\ < Y[ 2^^ +1 ^ < 2t + l. 

v£V 

where -<„ is an arbitrary total order on V, and p a (v) is 
the number of vertices preceding v in this order, which are 
neighbors of v. This recovers Kahn's unpublished result [27] 
for d-regular graphs, which generalized his earlier result [25] 
for the d-regular, bipartite case. Note that we removed the 
assumption of regularity in Kahn's result by making a choice 
of ordering. 

There is another way to view this result that is useful in 
computational geometry. Namely, if one considers a region (of, 
say, Euclidean space) and a finite family of subsets T = {A v : 
v 6 V} of this region, then one can define the intersection 
graph Gjr of this family by connecting i and j in V if and 
only if Ai n Aj ^ <j>. Then the independent sets of Gjr are in 
one-to-one correspondence with packings of the region using 
sets in the family T. Thus Corollary II also gives a bound on 
the number of packings of a region using a given family of 
sets. 

Another easy corollary of Theorem III is to graph colorings. 
Recall that a (proper) r-coloring of the vertices of G is a 
mapping / : V — > [r] so that u,v E V and uv € E implies 
that f(u) ^ f{v). Consider the constraint graph be F = K r , 
a complete graph on r vertices, for r > 2. Then Hom(G, K r ) 
corresponds to the number of (proper) r-colorings of the 



vertices of G. Thus the above theorem yields a corresponding 
upper bound on the number of r-colorings of a graph G, by 
replacing Hom(K p t v \ p / v \, F) in (JTOj with the number of r- 
colorings of the complete bipartite graph K p ^ p ^ v y 

VI. An Application to Determinantal Inequalities 

The connection between determinants of positive definite 
matrices and multivariate normal distributions is classical. 
For example, Bellman's text [3] on matrix analysis makes 
extensive use of an "integral representation" of determinants 
in terms of an integrand of the form e -< x > Ax > ( which is 
essentially the Gaussian density. The classical determinan- 
tal inequalities of Hadamard and Fischer then follow from 
the subadditivity of entropy. This approach seems to have 
been first cast in probabilistic language by Dembo, Cover 
and Thomas [11], who further showed that an inequality of 
Szasz can be derived (and generalized) using Han's inequality. 
Following this well-trodden path, Proposition II yields the 
following general determinantal inequality. 

Corollary III: [Determinantal Inequalities] Let K be a 
positive definite n x n matrix and let C be a hypergraph on [n]. 
Let K(s) denote the submatrix corresponding to the rows and 
columns indexed by elements of s. Then, using \M\ denote 
the determinant of M, we have for any fractional partition a*, 

sec 1 v ' n sec 

The proof follows from Proposition II via the fact that 
any positive definite n x n matrix K can be realized as 
the covariance matrix of a multivariate normal distribution 
N(0, K), whose entropy is 

ff(Jf w ) = Alog [(2*e) n \K\], 

and furthermore, that if A^„] ~ N(0, K), then X s ~ 
N(0, K(s)). Note that an alternative approach to proving 
Corollary III would be to directly apply Theorem I to 
the known fact (called the Koteljanskii or sometimes the 
Hadamard-Fischer inequality) that the set function f(s) = 
log|i^(s)| is submodular. 

For an r-regular hypergraph C, using the degree partition in 
Corollary III implies that 

|jr| r <ni^)l- 

sec 

Considering the hypergraphs C\ and C n _i then yields the 
Hadamard and prototypical Szasz inequality, while the Fischer 
inequality follows by considering C = {s, s c }, for an arbitrary 
s c [n]. 

We remark that one can interpret Corollary III using the all- 
minors matrix-tree theorem (see, e.g., Chaiken [7] or Lewin 
[32]). This is a generalization of the matrix tree theorem 
of Kirchhoff [29], which states that the determinant of any 
cofactor of the Laplacian matrix of a graph is the total number 
of distinct spanning trees in the graph, and interprets all minors 
of this matrix in terms of combinatorial properties of the graph. 
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VII. Duality and Monotonicity of Gaps 
Consider the weak fractional form of Theorem I, namely 

E7W/w^ c )</(N)<E^)/( s )- 



sec 



sec 



We observe that there is a duality between the upper and lower 
bounds, relating the gaps in this inequality. 

Theorem TV:[Duality of Gaps] Let / : 2^ ^Rbea 
submodular function with f(<fi) = 0. Let 7 be an arbitrary 
fractional partition using some hypergraph C on [n]. Define 
the lower and upper gaps by 



Gap L (/,C l7 ) = /(W) - 5>(*)/M* c ) 



sec 



and Ga Pt/ (/,C, 7 ) = ]>>( s )/( s ) " /(N)- 



(12) 



sec 



Then 



Ga Pt/ (/,C, 7 ) _ Ga Pi (/,C, 7 ) 



(13) 



where w is the weight function and 7 is the dual fractional 
partition defined in Section [il] 

Proof: This follows easily from the definitions. Indeed, 

/(M)-£t(<)/(s c |s) 



sec y " 



Esed(s)m 



;seC 

u>(7) — 1 
1 



u>(7) — 1 



w(7) — 1 

£700/(*)-/(N) 



/([«]) 



sec 



and 



/ — \ -/ C \ J2sec 7( s ) 

-(7) = L 7(- ) = ^y^T 



s c ec 



Mr/) 
10(7) — 1 



Dividing the first expression by the second yields the result. 



Note that the upper bound for /([n]) with respect to (C, 7) 
is equivalent to the lower bound for /([n]) with respect to 
the dual (C, 7), implying that the collection of upper bounds 
for all hypergraphs and all fractional coverings is equivalent 
to the collection of lower bounds for all hypergraphs and all 
fractional packings. Also, it is clear that under the assumptions 
of Corollary I, one can state a duality result extending Theorem 
IV by replacing 7 by any fractional covering a, and 7 by the 
dual fractional packing a. 

From Theorem IV, it is clear by symmetry that also 



Gap L (/,C,7) _ Ga P[/ (/,C, 7 ) 



(14) 



10(7) 

However, the identities ( p~3] > and ( fl4| ) do not imply any relation 
between Gap [/ (/,C,7) and Gap L (/,C,7). 



The gaps in the inequalities have especially nice structure 
when they are considered in the weak degree form, i.e., for 
the fractional partition using a r-regular hypergraph C, all of 
whose coefficients are 1/r. The associated gaps are 



1 



g L (LC) = f([n])--^f( S \s c ) 

sec 

f(s)-f([n]). 



and gu(f,C) = 



(15) 



sec 



Corollary IV: [Duality for Regular Collections] Let 
/ : 2l"l — > R be a submodular function with f(cf>) — 0. For a 
r-regular collection C, 

9L(f,C) 



9u (f,C) \C\-r 



Let us now specialize to the entropy set function e(s)- we 
use this to mean either H(X S ) (if the random variables Xj are 
discrete) or h(X s ) (if the random variables Xi are continuous). 
The special hypergraphs Ck, k — 1,2, ... ,n, consisting of all 
fc-sets or sets of size k, are of particular interest, and a lot 
is already known about the gaps for these collections. For 
instance, Han's inequality [23] already implies Proposition 
I for these hypergraphs, and Corollary IV applied to these 
hypergraphs implies that 

9L(e,C n -k) _ k 
gu(e,C k ) n-k' 

recovering an observation made by Fujishige [17]. Indeed, 
Theorem IV and Corollary IV generalize what [17] interpreted 
using the duality of polymatroids, since our assumptions are 
weaker and the assertions broader. Fujishige [17] considered 
these gaps important enough to merit a name: building on 
terminology of Han [23], he called the quantity gjj(e,Ck) a 
"total correlation", and gL(e,Ck) a "dual total correlation". 
In two particular cases, the gaps have simple expressions as 



relative entropies (see Section IX for definitions). First, note 
that the lower gap in Han's inequality ([3]) is related to the 
dependence measure that generalizes the mutual information. 

(n - l)5i(e,C„_i) = gu{e,Ci) 

= 5>(«)-e(M) (16) 

ie[n] 

= D(P X[n] \\P Xl x...xP Xn ). 

It is trivial to see that the gap is zero if and only if the random 
variables are independent. 

Second, the lower gap in Proposition I with respect to the 
singleton class C\ is related to the upper gap in the prototypical 
form Q of Han's inequality. 

g L (e,d) = (n- l)gu(e,C n -i) 

= D(P XllX[n] jP Xllx jP). ( 17 > 

ie[n] 

(Here the last equality comes from simple manipulation of the 
pointwise log likelihoods.) Note that for the gap to be zero, 
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each of the relative entropies on the right must be zero. In 
particular, D(Px 1 \x [2 . n] \\Px 1 ) = 0, which implies that X\ is 
independent of the remaining random variables. By applying 
the same fact to the collection of random variables under 
different orderings, one sees that X\ n i must be an independent 
collection of random variables. 

The latter observation is relevant to the study of the erasure 
entropy of a collection of random variables, defined by Verdu 
and Weissman [50] to be 



To complete the proof, note that 

:/(«\o = E 



= £ £ /(-) 

iG [n] seCfc ,i£s 

= (n-k)j2m- 

sec k 



£ 
EE 

sec, 



H-(X [n] ) = Y i H(X i \X ln] \ i ). 



Again specializing to the joint entropy function, let 



»=i 



They give several motivations for defining these quantities; 
most significantly, the erasure entropy has an operational 
significance as the number of bits required to reconstruct a 
symbol erased by an erasure channel. Theorem 1 in [50] states 
that H~(Xt n ]) < H(Xi n ]) with equality if and only if the 
Xi are independent. The inequality here is simply the lower 
bound of Proposition I applied to the singleton class C±, and is 
thus a special case of our results. The difference between the 
joint entropy of Xr n i and its erasure entropy is just <?x,(e,Ci), 
and the characterization of equality in terms of independence 
follows from the remarks above. It would be interesting to see 
if the more general bounds on joint entropy developed here 
can also be given an operational meaning using appropriate 
erasure-type channels. 

Apart from the eponymous duality between the total and 
dual total correlations discussed above, these quantities also 
satisfy a monotonicity property, sometimes called Han's theo- 
rem (cf., [23]). Since this complements the duality result, we 
state it below in the more general submodular function setting. 

Corollary V: [Monotonicity of Gaps] Let / : 2^ — * K 
be a submodular function with /(</>) = 0, and let <?£(/, Cfc) 
and gu(f,Ck) be defined by ( |15) , Then both <7z,(/,Cfc) an d 
gu(f,Ck) are monotonically decreasing in k. 

Proof: Proposition I, applied to the collection Cfc, imme- 
diately implies that = gu(f,Cn) < 9u(f,Ck), for k £ [n], 
on observing that r_(s) = r + (s) = ClZi)- To obtain the full 
chain of inequalities, first note that for any s in Ck+i, 



/(«)<r£/0»\*)- 



Thus 



9u(f,Ck) - gu(f,Ck+i) 

= T^ry E /( s ) - t^ty E -^ s ) 

\k-l) s£C k 



k I sec k 



> 



'n— 1\ 
\k-\t 



E/w-^ib E £/(-\o 



sec k 



seCfc+i ies 



1 



E 

»:|a|=fc 



k 



denote the joint entropy per element for subsets of size k 
averaged over all fc-element subsets, and 

JL) _ 1 \p e(s\s c ) 
^ k 

s:|s|=fc 

denote the corresponding average of conditional entropy per 
element. Since gu(e,Ck) = — e([nl) and gz,(e,Cfc) = 



e([?i])-ne[ L ', Corollary V asserts that e^' is decreasing in k, 
while is increasing in k. Dembo, Cover and Thomas [11] 
give a nice interpretation of this fact, briefly outlined below. 

Suppose we have n sensors collecting data relevant to the 
task at hand. For instance, the sensors might be measuring the 
temperature of the ocean at various points, or they might be 
evaluating the probability that a human face is in a collection 
of camera images taken along the boundary of a high-security 
site, or they might be taking measurements of neurons in a 
monkey's brain. Suppose due to experimental conditions, at 
any time, we only have access to a random subset of m sensor 
measurements out of n. Then Han's monotonicity theorem 
implies that, on average, we are getting more information as 
m increases, etc. 

VIII. Entropy Power Inequalities 

Theorem V implies similar inequalities for entropy powers. 
Recall that the entropy power of the random vector X s is 

2h(X s ) 

N(X s ) = e^^. 

This is sometimes standardized by a constant (2ire), which is 
convenient in the continuous case as it allows for a comparison 
with a multivariate normal distribution. For discrete random 
variables, one can replace h by H in the above definition. 

Corollary VI: Let 7 be any fractional partition of [n] using 
the hypergraph C. Then 

M(X [n] ) <J2w s M(X s ), 
sec 



(U) 



where u>„ 



are weights that sum to 1 over s € C. 



Proof: First note that 



70) 



sec 



sec 



E^=E^ E 7M = i. 



n 

i£[n] seC,s9i 
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since 7 is a fractional partition. Thus 

~2h{X [n] ) 

( -" M 'l ;; 

seC 



exp 



<exp( 2 ^ 7 ( s )M^.s) 
k sec 

f ^ 2h{X s ) 

= ex p(E^^r- 

k sec 11 
<E^^W> 



sSC 

where the first inequality follows from Proposition II, and the 
last inequality follows by Jensen's inequality. ■ 

Remark 8: Corollary VI generalizes an implication of The- 
orem 16.5.2 of Cover and Thomas [10], which looks at the 
collections of fc-sets. Note that, as in the special case covered 
in [10], Corollary VI continues to hold with the entropy 
power JV = W2 replaced throughout by any of the quantities 
Af c (X s ) = exp{c/i(X s )/|s|} for any c > 0. As in the 
case of entropy, the bounds on the entropy powers associated 
with the hypergraphs C m and the degree covering satisfy a 
monotonicity property. Indeed, by Theorem 16.5.2 of [10], 

W seCn-m 

is a decreasing sequence in m. 

More interesting than entropy power inequalities for joint 
distributions, however, are entropy power inequalities for sums 
of independent random variables with densities. Introduced 
by Shannon [45] and Stam [48] in seminal contributions, 
they have proved to be extremely useful and surprisingly 
deep- with connections to functional analysis, central limit 
theorems, and to the determination of capacity and rate regions 
for problems in information theory. Recently the first author 
showed (building on work by Artstein, Ball, Barthe and Naor 
[2] and Madiman and Barron [33]) the following generalized 
entropy power inequality. For independent real-valued random 
variables Xi with densities and finite variances, 



ie[n] 



sec 



i 6 s 



(18) 



for any fractional partition 7 with respect to any hypergraph C 
on [n]. Inequality ( fT"8j ) shares an intriguing similarity of form 
to the inequalities of this paper, although it is much harder to 
prove. 

The formal similarity between results for joint entropy and 
for entropy power of sums extends further. For instance, the 
fact that 



1 



(") 

via/ 



E ME* 



sec n - 



is an increasing sequence in m, can be thought of as a formal 
dual of Han's theorem. It is an open question whether upper 
bounds for entropy power of sums can be obtained that are 
analogous to the lower bound in Theorem I" . 



IX. An Inequality for Relative Entropy, and 
Interpretations 

Let A be either a countable set, or a Polish (i.e., complete 
separable metric) space equipped as usual with its Borel 
(T-algebra of measurable sets. Let P and Q be probability 
measures on the Polish product space A n . For any nonempty 
subset s of [n], write P s for the marginal probability measure 
corresponding to the coordinates in s. Recall the definition of 
the relative entropy: 



Dt 



s) — Ef 



log 



dPs 



e [0, 



when P s is absolutely continuous with respect to (_ 
D(F 3 \\Q S ) = +00 otherwise. 

One may also define the conditional relative entropy by 



and 



D(F s \ t \\Qs\t\V) = E Pt D(F s \ t \\® a \ t ), 



(19) 



where P s i t is understood to mean the conditional distribution 
(under P) of the random variables corresponding to s given 
particular values of the random variables corresponding to t; 
then Ep t denotes the averaging using P t over the values that 
are conditioned on. With this definition, it is easy to verify the 
chain rule 

d{sut) =D(P a |t||Q,| t |P) + d(*) 

for disjoint s and t, so that following the terminology devel- 
oped in Section [III] we have 

d(s\t) = £>(P 5 | t ||Q s | t |P). 

We have freely used (regular) conditional distributions in these 
definitions; the existence of these is justified by the fact that 
we are working with Polish spaces. 

Theorem V: Let Q be a product probability measure on A n , 
where A is a Polish space as above. Suppose P is a probability 
measure on A n such that the set function d : 2'- n ' — > [0, 00] 
given by 

d(s) =D(F S \\Q S ) 

does not take the value +00 for any s C [n]. Then d(s) is 
supermodular. 

Proof: For any nonempty s,t C [n], we have 

d(s U t) + d(s n f) - d(s) - d(t) 
= [d(s U t) - d(t)] - [d(s) - d(s n t)} 
= d(s U t \ 1 1 1) - d(s \ s n 1 1 s n t). 

Since s U t \ t = s \ s n t, it would suffice to prove for disjoint 
sets s' and t that 



d(s'\t) > d(s'\t') (20) 
is a product probability 

d(s'\t) = E Pt D(P s , lt \\Q s ,) = E Pt ,Er tv ,D(P s , lt \\Q s ,) 



for any t' C t. 

However observe that, since 
measure, 
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and 



d(s'\t') = Er t ,D(F s , lt ,\\Q s ,) = E Pt ,D(E P P^HQ./), 



so that ( |20| ) is an immediate consequence of the convexity of 
relative entropy (see, e.g., [10]). ■ 

Based on the supermodularity proved in Theorem V, The- 
orem I applied to —d(s) immediately implies the following 
corollary. 

Corollary VII: Under the assumptions of Theorem V, 

^ 7 (5) J D(P s | s c X>s ||Q s |P) > £>(P w ||Q [n] ) 



(21) 



sec 

>^ 7 ( S ) J D(P S | <S ||Q S |P), 
sec 

where 7 is any fractional partition using any hypergraph C on 
[n]. 

Remark 9: We mention a hypothesis testing interpretation for 
the following easier-to-parse corollary of Corollary VII: for 
r-regular hypergraphs C on [n], 



£>(P [n 



> ^£)(P S ||Q S ). 

sec 



(22) 



Suppose P and Q are two competing hypotheses for the joint 
distribution of X\ n ] . Then it is a classical fact due to Chemoff 
(see, e.g., Cover and Thomas [10], where it is called Stein's 
lemma) that the best error exponent for a hypothesis test 
between P and Q based on a large number of i.i.d. observations 
of the random vector Xr n i is given by D(Pr n i ||Q[n])- One may 
ask the following question: If one has partial access to all 
observations (for instance, one observes only X s out of each 
XlS), then how much is our capacity to distinguish between 
the two hypotheses P and Q worsened? Corollary VII can be 
interpreted as giving us estimates that relate our capacity to 
distinguish between the two hypotheses given all the data to 
our capacity to distinguish between the two hypotheses given 
various subsets of the data. 

Interestingly, Corollary VII implies a tensorization prop- 
erty of the entropy functional Entq(/) = Eq[f log f] — 
(Eqf) log(Eqf), defined for positive functions /. From the 
special case of Corollary VII corresponding to Han's in- 
equality (i.e., the hypergraph C„_i), one obtains the classical 
tensorization property, as noticed by Massart [38]. We present 
below a generalized tensorization inequality for the entropy 
functional with respect to a product measure by utilizing the 
power of Corollary VII more fully. 

Corollary VIII: Let C be an r-regular hypergraph on [n] . Then 

EntQ H ( ff ) < ^-E Q J2 Ent ®s(g) 

sec 

We omit the proof, which is based on the observation 
that Ent Q (/) = (Eqf) £>(P||Q), where P is the probability 
measure such that 4£ = -J—s, and follows the same line of 
argument as in [38]. 



The tensorization property of the entropy functional is 
of enormous utility in functional analysis, and the study of 
isoperimetry, concentration of measure, and convergence of 
Markov processes to stationarity. For instance, see Gross [22], 
Bobkov and Ledoux [4], and Kontoyiannis and Madiman [30], 
where the classical tensorization property is used to prove 
logarithmic Sobolev inequalities for Gaussian, Poisson and 
compound Poisson distributions respectively. 

X. Historical Remarks 

It turns out that the main technical result of this paper, 
Theorem I, is related to a wide body of work in a number of 
fields, including the study of combinatorial optimization of set 
functions in computer science, the study of cooperative games 
in economics, the study of capacities in probability theory, 
and of course the study of structural properties of entropy in 
information theory, which has been our present focus. In this 
section, we sketch these connections and place our work in 
context. 

The following terminology is useful. 

Definition IX: The set function / is fractionally subadditive 
if 



/(M)<E^ S )/( S )' 



(23) 



sec 



for any C C 2H and for any fractional partition 7 : C — ► R + 
of [raj. If the inequality is reversed, we say / is fractionally 
superadditive. 

Note that Theorem I has the following corollary (basically 
Proposition II for general submodular functions), obtained by 
using Q to weaken the upper bound in Theorem I. 

Corollary IX: If / is submodular and /(</>) = 0, then it is 
fractionally subadditive. 

This result has a long history, and has rarely been explicitly 
stated in the literature although aspects of it have been 
rediscovered on multiple occasions in various fields. First we 
describe how it is implicit in the classical theory of cooperative 
games. 

In cooperative game theory, a set function / : 2^" — ► R + is 
called a value function; it can be thought of as describing the 
payoff that can be obtained by arbitrary coalitions of n players, 
and it is canonical to take f(<fi) = 0. Different assumptions on 
the value function / correspond to different kinds of games. 
For instance, a balanced game is one for which the value 
function is fractionally superadditive, i.e., 



/(M) >$>oo/oo 



(24) 



sec 



holds for every fractional partition 7. If the value function / is 
supermodular, the corresponding game is said to be a convex 
game. 

One solution concept for cooperative games is the core, a 
subset of Euclidean space representing possible allocations of 
the payoff to players. (We do not bother to define it here; it 
suffices for our brief remarks here to know that such a notion 
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exists.) The fundamental Bondareva-Shapley theorem [5], [46] 
states that the game with transferable utility associated with 
the value function / has a non-empty core if and only if it 
is balanced. Separately, it is known from even earlier work 
of Kelley [28] (see also Shapley [47] who rediscovered it in 
the language of games) that a convex game has a non-empty 
core. Putting these together, one sees that a convex game must 
be balanced. This yields a statement very similar to that of 
Corollary IX. 

Much more recently, yet another related approach to the re- 
lationship between submodularity and fractional subadditivity 
has come from the theory of combinatorial auctions. Lehmann, 
Lehmann and Nisan [31] showed that every submodular func- 
tion is "XOS" (terminology that again we do not bother to 
explain here). Feige [14] showed that XOS and fractionally 
subadditive are identical. We refer the reader to the mentioned 
papers for definitions and details. 

To summarize, the literature from cooperative game theory 
and combinatorial auction theory imply Corollary IX. 

While we had expected direct proofs of Corollary IX to 
exist in the literature, we had initially been unable to find a 
reference. After the first version of this paper was written and 
presented at various venues, we were informed by Alan Sokal 
that it has indeed been explicitly stated and proved in the 
French statistical physics literature by Moulin Ollagnier and 
Pinchon [40] (see also van Enter, Fernandez and Sokal [49], 
where it is applied to entropy in a statistical physics context). 

The above discussion is also related to the theory of poly- 
matroids. A nondecreasing and submodular set function / : 
2[™1 — > K + with /((/>) = is sometimes called a /3-function. 
This class of functions has been intensely studied ever since 
the pioneering work of Edmonds [13], who used them to 
define polymatroids. Note that the nondecreasing property 
(i.e., f(t) < f(s) whenever t C s C [n]) implies that / is 
non-negative. It is pertinent to note that the extra properties 
inherent in polymatroid theory are not required for Corollary 
IX and Theorem I (for instance, a non-negativity requirement 
for / would rule out an application to the differential entropy); 
so Theorem I is really just a basic fact about submodular 
functions. 

XI. Discussion 

The inequalities presented in this note are contributions 
to a large body of work on the structural properties of the 
entropy function for joint distributions. While the origins of 
such work clearly lie in Shannon's foundational paper, let us 
again mention (see also the discussion after Theorem P) that 
the important observation of submodularity of the joint entropy 
function goes back at least to Fujishige [17]. There have 
also been interesting new developments in the last few years, 
namely the discovery of the so-called "non-Shannon inequali- 
ties". Motivated by the goal of characterizing the possible joint 
entropy set functions e(s) = H(X S ) for the discrete entropy as 
the underlying joint distribution is varied arbitrarily, Zhang and 
Yeung [51] revealed a fascinating phenomenon: if one thinks 
of each such e (corresponding to any joint distribution on n 
copies of a discrete alphabet) as being a vector of dimension 



2", then the set of vectors one obtains in this manner is a strict 
subset of the set of vectors corresponding to polymatroidal 
functions for any n > 4. The constraints on joint entropy 
that are not automatic consequences of a polymatroid property 
were termed "non-Shannon inequalities" in [51]. For more 
recent developments on this subject, one may consult Ibinson, 
Linden and Winter [24], Matus [39], or Dougherty, Freiling 
and Zeger [12]. 

In the context of these works, it is pertinent to note that all 
of the inequalities in this paper are Shannon inequalities, in 
the sense that they follow from submodularity of an entropy 
function. Indeed, our study was based on the set function 
e(s) = H(X S ), from consideration of which our main entropy 
inequality (Theorem I') was derived. However, since we now 
know from the mentioned literature that entropy satisfies ad- 
ditional constraints beyond submodularity, a natural question 
arises. If it is true that the set function e(s) = H(X S \X <S ) 
is itself submodular, so that Theorem P then follows by an 
application of Corollary IX to e rather than an application of 
Theorem I to e, then we would have a tighter outer bound 
on the space of joint entropy set functions. The following 
counterexample shows that this is not the case. 

Proposition III: The set function e(s) is not submodular. 

Proof: We construct a counterexample with n = 4 
random variables. Consider the sets s = {1,3} and t = {3, 4}. 
Then s U t = {1, 3, 4} and s n t = {3}. If e is submodular, 
then since s contains the first element, 

H(X S ) + H(X t \X <t ) > H(X sUt ) + H{X snt \X <snt) ), 

which in our case becomes 

H( X {1,3}) + H( X {3A}\ X {1,2}) 

>H(X {1 - 3A] )+H(X {3} \X {h2} ). 
By the chain rule, 

#(^{1,3,4}) = #(-^{1,3}) + H(Xi\X^ li3 y), 

and 

H(X{3A}\X{1,2}) = H{X A \X{ 123 ^) + H(X 3 \X{ 12 }), 

so that p5) reduces to 

H(X {1 . 3} ) + H(X 4 \X {1<2>3} ) + H{X 3 \X {ia} ) 

> H(X {h3} ) + H(X 4 \X {h3} ) + H(X 3 \X {h2} ), 

and thence simply to H(X4\X^ .2,3}) > H (X^Xn 3}). 
However, this is in general not true since conditioning reduces 
entropy, and thus the hypothesis of submodularity is falsified. 

■ 

Note, however, that such a counterexample is only possible 
when s U t is strictly smaller than the index set [n] . 

The relationship between the inequalities for discrete and 
continuous entropy in this paper is worth noting. Observe that 
a slightly more general class of inequalities holds for discrete 
entropy as compared to differential entropy (for instance, only 
fractional partitions are allowed in the differential entropy 
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context in Theorem I'); however, this is not surprising and 
indeed follows from the equivalences explored by Chan [8]. 

The structural properties of entropy discussed in this work 
are not just of abstract interest. Some applications, to de- 
terminant inequalities and counting problems, have already 
been mentioned in earlier sections. The inequalities discussed 
also have close connections with several classical multiuser 
information theoretic problems, including the Slepian-Wolf 
data compression problem and the multiple access channel. 
In particular, for the Slepian-Wolf problem where data from n 
sources is to be losslessly compressed in a distributed fashion, 
it is the set function H(X s \X s c) rather than H(X S ) that plays 
the key role. Consequently, the lower bound in Theorem F 
has a crucial significance: it is equivalent to the existence of 
a rate point whose sum rate is the same as the rate achievable 
for non-distributed compression (namely iJ(Xr n i)), and is 
one way of showing that no extra cost is paid in terms of 
asymptotic rate for the distributed nature of the task. These 
connections merit a separate and more detailed exploration, 
and are discussed along with several other applications of 
cooperative game theory to multiuser problems in [36]. 

Chain rules for entropy and relative entropy have played an 
important role in information theory since their recognition 
by Shannon. Here we have presented several inequalities for 
information in joint distributions that go beyond the chain rules 
but can also be thought of as deeper consequences of them. 
While these relate the information in projections of a random 
vector onto different subspaces, more general inequalities 
can be formulated that apply to a rich class of functions 
beyond projections (such as the sum), and these are described 
along with applications to additive combinatorics and matrix 
analysis in the follow-up works [34], [37], We anticipate 
further extensions and applications of these inequalities in the 
future. . 

Acknowledgment 

We are indebted to Andrew Barron for many useful discus- 
sions, and for the indirect influence of Andrew's joint work 
[33] with MM on entropy power inequalities. We thank the 
organizers of the IEEE International Symposium on Informa- 
tion Theory 2006 in Seattle where we met and initiated this 
work, and Ravindra Bapat, Uriel Feige, Gil Kalai and Alan 
Sokal for help with references. PT is thankful to the Theory 
Group at Microsoft Research for hosting him during the period 
this research was carried out. We are also deeply indebted 
to three anonymous referees for very thorough feedback that 
eliminated an error and significantly improved the paper. 

References 

[1] N. Alon and J. H. Spencer. The probabilistic method. Wiley-Interscience 
Series in Discrete Mathematics and Optimization. John Wiley & Sons, 
New York, second edition, 2000. 

[2] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. Solution of Shannon's 
problem on the monotonicity of entropy. /. Amer. Math. Soc., 17(4):975- 
982 (electronic), 2004. 

[3] R. Bellman. Introduction to Matrix Analysis. McGraw-Hill, 1 960. 

[4] S.G. Bobkov and M. Ledoux. On modified logarithmic Sobolev inequal- 
ities for Bernoulli and Poisson measures. J. Funct. Anal, 156(2):347- 
365, 1998. 



[5] O. N. Bondareva. Some applications of the methods of linear pro- 
gramming to the theory of cooperative games (in russian). Problemy 
Kibernetiki, 10:119-139, 1963. 

[6] G. Brightwell and P. Tetali. The number of linear extensions of the 
boolean lattice. Order, 20:333-345, 2003. 

[7] S. Chaiken. A combinatorial proof of the all minors matrix tree theorem. 
SI AM J. Algebraic Discrete Methods, 3(3):3 19-329, 1982. 

[8] T. H. Chan. Balanced information inequalities. IEEE Trans. Inform. 
Theory, 49(12):3261-3267, 2003. 

[9] F.R.K. Chung, R.L. Graham, P. Frankl, and J.B. Shearer. Some 
intersection theorems for ordered sets and graphs. /. Combinatorial 
Theory, Ser. A, 43:23-37, 1986. 
[10] T.M. Cover and J. A. Thomas. Elements of Information Theory. J. Wiley, 
New York, 1991. 

[1 1] A. Dembo, T.M. Cover, and J. A. Thomas. Information-theoretic inequal- 
ities. IEEE Trans. Inform. Theory, 37(6): 1501-151 8, 1991. 

[12] R. Dougherty, C. Freiling, and K. Zeger. Networks, matroids and 
non-Shannon information inequalities. IEEE Trans. Inform. Theory (to 
appear), 2007. 

[13] J. Edmonds. Submodular functions, matroids and certain polyhedra. 

In Proc. International Conf. on Combinatorial Structures and their 

applications. Gordon and Beach, 1970. 
[14] U. Feige. On maximizing welfare when utility functions are subadditive. 

Preprint, 2006. 

[15] E. Friedgut. Hypergraphs, entropy, and inequalities. The American 

Mathematical Monthly, 1 1 1(9): 749-760, November 2004. 
[16] E. Friedgut and J. Kahn. On the number of copies of one hypergraph 

in another. Israel Journal of Mathematics, 105:251-256, 1998. 
[17] S. Fujishige. Polymatroidal dependence structure of a set of random 

variables. Information and Control, 39:55-72, 1978. 
[18] S. Fujishige. Submodular functions and optimization, volume 58 of 

Annals of Discrete Mathematics. Elsevier B. V., Amsterdam, second 

edition, 2005. 

[19] Z. Fiiredi. Scrambling permutations and entropy of hypergraphs. 

Random Structures Algorithms, 8(2):97-104, 1996. 
[20] D. Galvin. Personal communication. 2006. 

[21] D. Galvin and P. Tetali. On weighted graph homomorphisms. DIMACS- 

AMS Special Volume, 63:13-28, 2004. 
[22] L. Gross. Logarithmic Sobolev inequalities. Amer. J. Math., 97(4):1061- 

1083, 1975. 

[23] Te Sun Han. Nonnegative entropy measures of multivariate symmetric 
correlations. Information and Control, 36(2): 1 33—156, 1978. 

[24] B. Ibinson, N. Linden, and A. Winter. All inequalities for the relative 
entropy. Proc. IEEE Intl. Symp. Inform. Theory, Seattle, pages 237-241, 
2006. 

[25] J. Kahn. An entropy approach to the hard-core model on bipartite graphs. 

Combinatorics, Probability and Computing, 10:219-237, 2001. 
[26] J. Kahn. Entropy, independent sets and antichains: a new approach to 

Dedekind's problem. Proc. Amer. Math. Soc, 130(2):371-378, 2001. 
[27] J. Kahn. Personal communication. 2006. 

[28] J. L. Kelley. Measures on Boolean algebras. Pacific J. Math., 9:1165— 
1177, 1959. 

[29] G. Kirchhoff. Uber die auflosung der gleichungen, auf welche man bei 

der untersuchung der linearen verteilung galvanischer strome gefuhrt 

wird. Ann. Phys. Chem., 72:497-508, 1847. 
[30] I. Kontoyiannis and M. Madiman. Measure concentration for Compound 

Poisson distributions. Elect. Comm. Probab, 11:45-57, 2006. 
[31] B. Lehmann, D. Lehmann, and N. Nisan. Combinatorial auctions with 

decreasing marginal utilities. In Proceedings of the 3rd ACM conference 

on Electronic Commerce, Tampa, Florida, pages 18-28, 2001. 
[32] M. Lewin. A generalization of the matrix-tree theorem. Math. Z., 

181(l):55-70, 1982. 
[33] M. Madiman and A.R. Barron. Generalized entropy power inequalities 

and monotonicity properties of information. IEEE Trans. Inform. Theory, 

53(7), 2007. 

[34] M. Madiman, A. Marcus, and P. Tetali. Entropy and set 
cardinality inequalities for partition-determined functions, with 
applications to sumsets. Preprint, 2008. (Available online: 
|http: //arxiv. org/abs/0 901 .0055) 

[35] M. Madiman and P. Tetali. Sandwich bounds for joint entropy. Proc. 
IEEE Intl. Symp. Inform. Theory, Nice, June 2007. 

[36] M. Madiman. Cores of cooperative games in information theory. 
EURASIP J. on Wireless Comm. and Networking, no. 318704, 2008. 

[37] M. Madiman. Determinant and trace inequalities for sums of positive- 
definite matrices. Preprint, 2008. 

[38] P. Massart. Some applications of concentration inequalities to statistics. 
Annates de la Faculte des Sciences de Toulouse, IX(2):245-303, 2000. 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 



15 



[39] F. Matus. Two constructions on limits of entropy functions. IEEE Trans. 

Inform. Theory, 53(l):320-330, 2007. 
[40] J. Moulin Ollagnier and D. Pinchon. Filtre moyennant et valeurs 

moyennes des capacites invariantes. Bull. Soc. Math. France, 

110(3):259-277, 1982. 
[41] H. Narayanan. Suhmodular functions and electrical networks, volume 54 

of Annals of Discrete Mathematics. North-Holland Publishing Co., 

Amsterdam, 1997. 

[42] J. Nayak, E. Tuncel, and K. Rose. Zero-error source-channel codingwith 
side information. IEEE Trans. Inform. Th., 52:4626-4629, 2006. 

[43] J. Radhakrishnan. Entropy and counting. In Computational Mathemat- 
ics, Modelling and Algorithms (ed. J. C. Misra), Narosa, 2003. 

[44] E. R. Scheinerman and D. H. Ullman. Fractional Graph Theory. Wiley, 
1997. 

[45] C.E. Shannon. A mathematical theory of communication. Bell System 
Tech. J., 27:379^23, 623-656, 1948. 



[46] L. S. Shapley. On balanced sets and cores. Naval Research Logistics 
Quarterly, 14:453-560, 1967. 

[47] L. S. Shapley. Cores of convex games. International Journal of Game 
Theory, l(l):ll-26, 1971. 

[48] A.J. Stam. Some inequalities satisfied by the quantities of information 
of Fisher and Shannon. Information and Control, 2:101-112, 1959. 

[49] A. C. D. van Enter, R. Fernandez, and A. D. Sokal. Regularity properties 
and pathologies of position-space renormalization-group transforma- 
tions: scope and limitations of Gibbsian theory. /. Statist. Phys., 72(5- 
6):879-1167, 1993. 

[50] S. Verdu and T. Weissman. Erasure entropy. Proc. IEEE Intl. Symp. 
Inform. Theory, Seattle, 2006. 

[51] J. Zhang and R.W. Yeung. On characterization of entropy function via 
information inequalities. IEEE Trans. Inform. Th., 44:1440-1452, 1998. 



